Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems
نویسندگان
چکیده
We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres.
منابع مشابه
The System of Engagement in a Sample of Prose Fiction and the News
Emerging within Systemic Linguistics, Appraisal/Evaluation is a framework for analyzing the language of evaluation, providing techniques for the systematic analysis of evaluation and stance as they operate in whole texts and in groupings of texts. There are three systems in the Appraisal framework: Attitude, Engagement, and Graduation. This study sets out to analyze the use of the system of Eng...
متن کاملTowards a reference corpus of web genres
Genres of spoken and written texts are being intensively studied from various angles, e.g., communication studies, discourse analysis, computational linguistics, without arriving at a generally accepted definition. Many corpora have been built to represent the language, but very few large corpora indicate genres, and when they do the typology of genres varies widely. For instance, the Brown cor...
متن کاملTowards Automatic Web Genre Identification: A Corpus-Based Approach in the Domain of Academia by Example of the Academic's Personal Homepage
We argue for a systematic analysis of one particular, well structured domain—academic Web pages—with regard to a special class of digital genres: Web genres. For this purpose, we have developed a database-driven system that will ultimately consist of more than 3 000 000 HTML documents, written in German, which are the empirical basis for our research. We introduce the notions of Web genre type ...
متن کاملThe Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach
Research in multi-document summarization has focused on newswire corpora since the early beginnings. However, the newswire genre provides genre-specific features such as sentence position which are easy to exploit in summarization systems. Such easy to exploit genre-specific features are available in other genres as well. We therefore present the new hMDS corpus for multi-document summarization...
متن کاملThe Impact of Noise in Web Genre Identification
Genre detection of web documents fits an open-set classification task. The web documents not belonging to any predefined genre or where multiple genres co-exist is considered as noise. In this work we study the impact of noise on automated genre identification within an open-set classification framework. We examine alternative classification models and document representation schemes based on t...
متن کامل